This report describes an analysis of raw counts of Nanostring GeoMx data provided by Diana Piol, and analyzed using the DESeq2 Bioconductor package. This report analyzes the inferred transcriptional changes between experimental groups and produces statistical diagnostic plots and lists of differentially expressed (DE) genes for each experimental comparison, and saves these resulting DE gene lists as Excel files.
## load libraries
library("dplyr")
library("tidyr")
library("tibble")
library("ggplot2")
library("ggrepel")
library("DESeq2")
library("RColorBrewer")
library("pheatmap")
library("readxl")
library("readr")
library("writexl")
library("janitor")
library("gridExtra")
library("biomaRt")
library("DT")
## set project variables
proj <- "NanoString"
sel_comp <- "CTRL"
tissue <- "sciatic_nerve"
sel_comp1 <- "Chatpos_vs_Chatneg"
model <- "segment"
We load the complete Nanostring count matrix of all samples and metadata describing the conditions of these samples. The metadata and matrix counts from NanoString data are subset to contain only CTRL sciatic nerve samples. The metadata (or “colData”) for the samples being compared is displayed in an HTML table shown below:
The following conditions being compared within CTRL samples:
Chatpos vs Chatneg (across only
CTRL)Note: all comparisons are based relative to the first comparison listed, which is used as the normative control
For this project, we are using the following model:
~ segment
The standard differential expression analysis steps are wrapped into
a single function, DESeq. The estimation steps performed by
this function are described in the DESeq2
vignette, and in the manual page for ?DESeq and in the
Methods section of the DESeq2 publication (Love, Huber, and Anders
2014).
dds <- DESeqDataSetFromMatrix(countData = cts %>% as.matrix(),
colData = meta_data,
design = ~ segment)
cat("Number of genes in dds object *before* filtering out rows with zero counts \n")
#> Number of genes in dds object *before* filtering out rows with zero counts
dim(assay(dds))[1]
#> [1] 19959
For QC reasons, genes with counts of all 0 for all samples (i.e. rows
in the count matrix with no count measurements) are removed before
running DESeq. This is because for rows with all zero
counts no variance can be modeled.
keep <- rowSums(counts(dds)) > 0
dds <- DESeq(dds[keep, ])
cat("Number of genes in dds object *after* filtering out rows with zero counts \n")
#> Number of genes in dds object *after* filtering out rows with zero counts
dim(assay(dds))[1]
#> [1] 19959
Results tables are generated using the function results,
which extracts a results table with log2 fold changes, p-values and
adjusted p-values. With no additional arguments to results, the log2
fold change and Wald test p-value will be for the first
variable in the design formula, the experiment group will be the
last variable.
res1 <- results(dds, contrast = c("segment", "ChATpos", "ChATneg"), alpha = 0.05)
PCA is a method of visually identifying the similarity or difference
between samples. PCA rotates the data cloud onto an orthogonal basis
determined by the dimensions of maximal variance. The first two
Principal Components (PCs) usually hold the majority of the variance of
the data. The following plots show the variance stabilized transformed
count matrix samples projected onto the two largest Principal Components
(i.e. PC1 and PC2). DESeq2 recommends two types of PCA
stabilizations that are performed prior to creating the PCA: * vst
“variance stabilizing transformation” * rld “regularized
log transformation”
## perform variance stabilizing transformation and PCA and plot with ggplot2
vst <- DESeq2::vst(dds)
pcaData <- plotPCA(vst, intgroup = c("segment"), returnData = TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
pcaData %>%
ggplot(aes(PC1, PC2, color = segment)) +
geom_point(size = 3) +
geom_text_repel(aes(label = colnames(vst)), force = 5,
arrow = arrow(length = unit(0.03, "npc"),
type = "closed", ends = "first")) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
theme(axis.ticks.x = element_blank(), axis.text.x = element_blank(),
axis.ticks.y = element_blank(), axis.text.y = element_blank()) +
coord_fixed()
## perform regularized log transformation and PCA and plot with ggplot2
rld <- rlogTransformation(dds)
pcaData <- plotPCA(rld, intgroup = c("segment"), returnData = TRUE)
percentVar <- round(100 * attr(pcaData, "percentVar"))
pcaData %>%
ggplot(aes(PC1, PC2, color = segment)) +
geom_point(size = 3) +
geom_text_repel(aes(label = colnames(rld)), force = 5,
arrow = arrow(length = unit(0.03, "npc"),
type = "closed", ends = "first")) +
xlab(paste0("PC1: ",percentVar[1],"% variance")) +
ylab(paste0("PC2: ",percentVar[2],"% variance")) +
theme(axis.ticks.x = element_blank(), axis.text.x = element_blank(),
axis.ticks.y = element_blank(), axis.text.y = element_blank()) +
coord_fixed()
Size factors are a method of normalizing used by the DESeq function to normalize the data in terms of sequencing depth. Size factor is the median ratio of the sample over a pseudosample: for each gene, the geometric mean of all samples. Size factors account for differences in sequencing depth are typically centered around 1 (indicating comparable sequencing depth).
dds$sizeFactor %>%
as.data.frame() %>%
tibble::rownames_to_column(var = "sample") %>%
dplyr::rename(size_factors = names(.)[2]) %>%
dplyr::left_join(meta_data, by = "sample") %>%
ggplot(aes(x = sample, y = size_factors, fill = segment)) +
geom_bar(stat = "identity") +
theme(axis.text.x = element_text(angle = 90, vjust = 0.5, hjust = 1)) +
geom_hline(yintercept = mean(dds$sizeFactor), color = "red", linetype = "dashed")
DESeq2 estimates gene dispersion using an algorithm that first generates gene-wise maximum likelihood estimates (MLEs) that are obtained using only the respective gene’s data (black dots). Then, a curve (red) is fit to the MLEs to capture the overall trend of dispersion-mean dependence. This fit is used as a prior mean for a second estimation round, which results in the final maximum a priori (MAP) estimates of dispersion. This results in a “shrinkage” of the noisy gene-wise estimates toward the consensus represented by the red line. The black points circled in blue are detected as dispersion outliers and not shrunk toward the prior (shrinkage would follow the dotted line). A more in-depth theoretical explanation of the DESeq2 algorithm can be found here
plotDispEsts(dds, main = "DESeq2 Dispersion plot")
DESeq2 performs independent filtering by default using the mean of normalized counts as a filter statistic. A threshold on the filter statistic (first value) is found which optimizes the number of adjusted p-values lower than significance level alpha. The adjusted p-values for the genes which do not pass the filter threshold are set to NA. The results also returns the mean of normalized counts (second value).
cat("Filter thresh. val. and mean of norm. counts", gsub("_", " ", sel_comp1), " \n")
#> Filter thresh. val. and mean of norm. counts Chatpos vs Chatneg
metadata(res1)$filterThreshold
#> 95%
#> 7.162842
The filterThreshold returns the threshold chosen (vertical line in the plots below) by the DESeq2 analysis of the lowest quantile of the filter for which the number of sample rejections is within 1 residual standard deviation to the peak of a curve fit to the number of rejections over the filter quantiles. The following diagnostic plot shows the number of rejected samples (y-axis) plotted against quantiles of filter (x-axis).
par(mfrow = c(1, 1))
plot(metadata(res1)$filterNumRej, type = "b", main = gsub("_", " ", sel_comp1),
xlab = "Quantiles of filter", ylab = "Number of rejections")
lines(metadata(res1)$lo.fit, col = "red")
abline(v = metadata(res1)$filterTheta)
The following plot shows the number of frequency of counts (y-axis) against p-values between 0 and 1 (x-axis).
par(mfrow = c(1, 1))
hist(res1$pvalue, col = "lavender", xlab = "p-values", main = gsub("_", " ", sel_comp1))
print(gsub("_", " ", sel_comp1))
#> [1] "Chatpos vs Chatneg"
print(summary(res1))
#>
#> out of 19959 with nonzero total read count
#> adjusted p-value < 0.05
#> LFC > 0 (up) : 0, 0%
#> LFC < 0 (down) : 16, 0.08%
#> outliers [1] : 1, 0.005%
#> low counts [2] : 18961, 95%
#> (mean count < 7)
#> [1] see 'cooksCutoff' argument of ?results
#> [2] see 'independentFiltering' argument of ?results
#>
#> NULL
We annotate the DE results with mouse Ensembl and
Entrez IDs by accessing the BioMart
database.
## annotate DE results with Biomart database
# listAttributes(mart = useDataset("mmusculus_gene_ensembl", useMart("ensembl")))
# getBM(attributes = c("ensembl_gene_id", "entrezgene_id", "external_gene_name"),
# mart = useDataset("mmusculus_gene_ensembl", useMart("ensembl"))) %>%
# dplyr::rename(ensembl_id = ensembl_gene_id,
# entrez_id = entrezgene_id,
# genes = external_gene_name) %>%
# dplyr::filter(stringr::str_length(genes) > 1,
# !duplicated(ensembl_id)) -> mouse_biomart
# saveRDS(object = mouse_biomart, file = "./data/mouse_biomart.rds")
mouse_biomart <- readRDS(file = "../202311_piol/data/mouse_biomart.rds")
The annotated DE results are arranged by
p-adjusted value. Normalized counts from the are added to
the right of the DE statistics.
## this code chunk extracts the Cook's Distance
as.data.frame(mcols(dds)$maxCooks) %>%
dplyr::rename(cooks_stat = names(.)[1]) %>%
dplyr::arrange(desc(cooks_stat)) %>%
tibble::rownames_to_column(var = "genes") -> top_cooks
merge(as.data.frame(res1), as.data.frame(counts(dds, normalized = TRUE)),
by = "row.names", sort = FALSE) %>%
dplyr::rename(genes = names(.)[1]) %>%
dplyr::left_join(mouse_biomart, by = "genes") %>%
dplyr::left_join(top_cooks, by = "genes") %>%
dplyr::arrange(padj) %>%
dplyr::select(genes, contains("_id"), cooks_stat, everything()) %>%
as.data.frame() -> res_df1
A MA plot illustrates log-fold expression change between two groups of samples, created by transforming and the data onto two scales: M (the log of the ratio of level counts for each gene between two samples) and A (the average level counts for each gene across the two samples) scales. MA plots demonstrates the difference between samples in terms of signal intensities of read counts. In this type of plot, genes with similar expression levels in two samples will appear around the horizontal line y = 0 (red line). The following MA plot illustrates log-fold expression change for each comparison after the DESeq2 analysis. Significant genes (P < 0.05) are highlighted in red.
results1 <- as.data.frame(res_df1)
results1[is.na(results1)] <- 0.99 # change NA results to 0.99 for correct MAplot
results1 %>%
ggplot(aes(x = baseMean, y = log2FoldChange)) +
geom_point(aes(colour = padj < 0.05), size = 0.5) +
scale_colour_manual(name = 'padj < 0.05',
values = setNames(c('red','black'), c(TRUE, FALSE))) +
scale_x_continuous(trans = "log10", limits = c(0.1, 300000)) +
geom_smooth(colour = "red") +
geom_abline(slope = 0, intercept = 0, colour = "blue") +
theme(plot.title = element_text(hjust = 0.5)) +
xlab("baseMean (A)") + ylab("log2FoldChange (M)") +
ggtitle(paste0("MA plot for ", gsub("_", " ", sel_comp1)))
A volcano plot is a type of scatter plot that is used to quickly identify genes that display large magnitude changes that are also statistically significant. A volcano plot is constructed by plotting the negative log of the p-value on the y-axis. This results in data points with low p-values appearing toward the top of the plot. The x-axis is the log of the fold change between the two experimental conditions. The log of the fold change is used so that changes in both directions appear equidistant from the center. Plotting points in this way results in two regions of interest in the plot: those points that are found toward the top of the plot that are far to either the left- or right-hand sides. These represent values that display large magnitude fold changes (on the left or right of center) as well as high statistical significance (toward the top). The following Volcano plot shows log of the fold change and negative log of the p-values for each comparison. Significant genes (P < 0.05) with log2 fold change (> 1) are highlighted in red.
res_df1 %>%
dplyr::filter(!is.na(pvalue)) %>%
dplyr::mutate(threshold = as.factor(abs(log2FoldChange) > 1 & pvalue < 0.05),
sig_group = as.factor(dplyr::case_when(
log2FoldChange > 1 & pvalue < 0.05 ~ "Control",
log2FoldChange < -1 & pvalue < 0.05 ~ "FUS",
TRUE ~ "not significant"))) %>%
ggplot(aes(x = log2FoldChange, y = -log10(pvalue), color = sig_group)) +
geom_point(alpha = 0.75, size = 0.75) +
geom_hline(yintercept = -log10(0.05), color = "red", linetype = "dashed") +
geom_vline(xintercept = -1, color = "red", linetype = "dashed") +
geom_vline(xintercept = 1, color = "red", linetype = "dashed") +
xlab("log2 fold change") +
ylab("-log10 p-value") +
theme(plot.title = element_text(hjust = 0.5)) +
scale_color_manual(values = c("limegreen", "magenta", "gray3")) +
ggtitle(paste0("Volcano plot for ", gsub("_", " ", sel_comp1)))
Count plots are created for the top 5 DE genes for each comparison.
res_df1 %>%
dplyr::slice(1:5) %>%
dplyr::select(genes, tidyr::contains("DSP")) %>%
tidyr::gather(key = "sample", value = "count", -genes) %>%
dplyr::left_join(meta_data, by = "sample") %>%
ggplot(aes(x = sample, y = log10(count), color = segment)) +
geom_point() +
theme(axis.text.x = element_blank()) +
ggtitle(paste0("Gene counts of top 5 genes for ", gsub("_", " ", sel_comp1),
" ", tissue)) +
facet_grid(~genes)
Below are three html tables displaying the results the DE comparisons. These tables are interactive and be queried for specific proteins or sorted by column.
Note on p-values set to NA: some values in the results table can be set to NA for one of the following reasons: * If within a row, all samples have zero counts, the baseMean column will be zero, and the log2 fold change estimates, p-value and adjusted p-value will all be set to NA. * If a row contains a sample with an extreme count outlier then the p-value and adjusted p-value will be set to NA. These outlier counts are detected by Cook’s distance. * If a row is filtered by automatic independent filtering, for having a low mean normalized count, then only the adjusted p-value will be set to NA.
res_df1 %>%
dplyr::select(-c(contains("_id"), cooks_stat)) %>%
dplyr::mutate_at(vars(baseMean:stat), round, 3) %>%
dplyr::mutate_at(vars(matches("DSP")), round, 3) %>%
DT::datatable()
Results are saved as .csv and .rds files which are unfiltered.
These Excel files have the following columns:
genes = Hugo gene symbol (official gene name)entrez_id = Entrez gene IDensembl_id = Ensembl gene IDnon_zero_de = number of non-zero values across gene
rowsperc_non_zero_de = percent non-zero values across gene
rows out of all samplesbaseMean = DESeq2 average mean count per genelog2FoldChange = log2 normalized fold change between
two DE conditionslfcSE = standard Normal distribution to generate a
two-tailed p-valuestat = the difference in deviance between the reduced
model and the full model, which is compared to a chi-squared
distribution to generate a pvaluepvalue = p-valuepadj = Benjamini-Hochberg FDR p-adjusted valueNote: Cook’s Distance could not be calculated because we have genotype groups of less than 4 samples.
## populdate list of DE results
de_list <- list(res_df1)
## name list headers to become sheet names
names(de_list) <- c(paste0(gsub("_", " ", sel_comp1)))
## save as Excel and RDS
saveRDS(object = de_list, file = paste0(
"./data/", proj, "_", tissue, "_", sel_comp, "_", model, "_DE_res_", Sys.Date(), ".rds"))
writexl::write_xlsx(x = de_list, path = paste0(
"./data/", proj, "_", tissue, "_", sel_comp, "_", model, "_DE_res_", Sys.Date(), ".xlsx"))
Session Info
sessionInfo()
#> R version 4.4.0 (2024-04-24)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Sonoma 14.5
#>
#> Matrix products: default
#> BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0
#>
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#>
#> time zone: Europe/Brussels
#> tzcode source: internal
#>
#> attached base packages:
#> [1] stats4 stats graphics grDevices utils datasets methods
#> [8] base
#>
#> other attached packages:
#> [1] DT_0.33 biomaRt_2.60.0
#> [3] gridExtra_2.3 janitor_2.2.0
#> [5] writexl_1.5.0 readr_2.1.5
#> [7] readxl_1.4.3 pheatmap_1.0.12
#> [9] RColorBrewer_1.1-3 DESeq2_1.44.0
#> [11] SummarizedExperiment_1.34.0 Biobase_2.64.0
#> [13] MatrixGenerics_1.16.0 matrixStats_1.3.0
#> [15] GenomicRanges_1.56.0 GenomeInfoDb_1.40.0
#> [17] IRanges_2.38.0 S4Vectors_0.42.0
#> [19] BiocGenerics_0.50.0 ggrepel_0.9.5
#> [21] ggplot2_3.5.1 tibble_3.2.1
#> [23] tidyr_1.3.1 dplyr_1.1.4
#>
#> loaded via a namespace (and not attached):
#> [1] DBI_1.2.2 httr2_1.0.1 rlang_1.1.3
#> [4] magrittr_2.0.3 snakecase_0.11.1 compiler_4.4.0
#> [7] RSQLite_2.3.7 mgcv_1.9-1 png_0.1-8
#> [10] vctrs_0.6.5 stringr_1.5.1 pkgconfig_2.0.3
#> [13] crayon_1.5.2 fastmap_1.2.0 dbplyr_2.5.0
#> [16] XVector_0.44.0 labeling_0.4.3 utf8_1.2.4
#> [19] rmarkdown_2.27 tzdb_0.4.0 UCSC.utils_1.0.0
#> [22] purrr_1.0.2 bit_4.0.5 xfun_0.44
#> [25] zlibbioc_1.50.0 cachem_1.1.0 jsonlite_1.8.8
#> [28] progress_1.2.3 blob_1.2.4 highr_0.11
#> [31] DelayedArray_0.30.0 BiocParallel_1.38.0 parallel_4.4.0
#> [34] prettyunits_1.2.0 R6_2.5.1 bslib_0.7.0
#> [37] stringi_1.8.4 lubridate_1.9.3 jquerylib_0.1.4
#> [40] cellranger_1.1.0 Rcpp_1.0.12 knitr_1.46
#> [43] splines_4.4.0 Matrix_1.7-0 timechange_0.3.0
#> [46] tidyselect_1.2.1 rstudioapi_0.16.0 abind_1.4-5
#> [49] yaml_2.3.8 codetools_0.2-20 curl_5.2.1
#> [52] lattice_0.22-6 withr_3.0.0 KEGGREST_1.44.0
#> [55] evaluate_0.23 BiocFileCache_2.12.0 xml2_1.3.6
#> [58] Biostrings_2.72.0 filelock_1.0.3 pillar_1.9.0
#> [61] generics_0.1.3 hms_1.1.3 munsell_0.5.1
#> [64] scales_1.3.0 glue_1.7.0 tools_4.4.0
#> [67] locfit_1.5-9.9 grid_4.4.0 crosstalk_1.2.1
#> [70] AnnotationDbi_1.66.0 colorspace_2.1-0 nlme_3.1-164
#> [73] GenomeInfoDbData_1.2.12 cli_3.6.2 rappdirs_0.3.3
#> [76] fansi_1.0.6 S4Arrays_1.4.0 gtable_0.3.5
#> [79] sass_0.4.9 digest_0.6.35 SparseArray_1.4.1
#> [82] farver_2.1.2 htmlwidgets_1.6.4 memoise_2.0.1
#> [85] htmltools_0.5.8.1 lifecycle_1.0.4 httr_1.4.7
#> [88] bit64_4.0.5
References:
Love, M.I., Huber, W., Anders, S. (2014) “Moderated estimation of fold change and dispersion for RNA-seq data with DESeq2.” Genome Biology, 15:550. 10.1186/s13059-014-0550-8
Anders, Simon, and Wolfgang Huber. 2010. “Differential Expression Analysis for Sequence Count Data.” Genome Biology 11:R106. http://genomebiology.com/2010/11/10/R106.
Ritchie ME, Phipson B, Wu D, Hu Y, Law CW, Shi W, Smyth GK (2015). “limma powers differential expression analyses for RNA-sequencing and microarray studies.” Nucleic Acids Research, 43(7), e47. doi: 10.1093/nar/gkv007.